iterators.combine? mean?class one
class two
class three
class four
class five
class six
class seven
class eight
class nine
class ten
class eleven
class twelve
These are computational tasks which involve many separate, independently executable calculations. Some common statistical examples of embarassingly parallel processes:
dorng)In contrast, some sequential or non-parallel processes:
step())For loops that do not explicitly involve dependent calculations are wasteful, especially if we have multiple processors available. Perhaps even worse, the time cost of using a wasteful approach can put some useful statistical tools beyond our reach!
apply() functions can help, but still doesn’t use multiple processorsparallel package (thanks, Miranda!)foreach packages!We would like to find a way to make use of our whole computer, and make useful tasks like bootstrapping available to use, but without having to invest large amounts of time in learning new programming languages. Enter foreach, which keeps the structure of a for loop, but allows us to drop two key assumptions:
Our goal: transform a traditional for loop into a foreach loop We will begin with a simple chunk of R code involving a for loop, and transform it. Along the way, we’ll take a look at the equivalent computation done with an apply() function, and see that using foreach and multiple processors outperforms this too.
We are going to look at data from the New York City bikeshare program (Citibike).
We want to find a model which can offer good prediction. Start with a few plausible models and use K fold cross validation to decide which one to use.
We now suppose that we’re considering three increasingly complex models of the arrival behavior. In order to compare three candidate models prediction error, we’ll use K-fold cross validation, where we use the same folds for all three models.
Here’s some familiar code that sets things up:
source("~/git/fixschewitt/foreach/citibike_new_york/EDA/subsetting_arrivals_attributes.R")
lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q}
get.errs <- function(test.set = NULL,
discarded = NULL,
q = q) {
sml.glm <- glm(arrivals ~
bs(hour, degree = 4)
+ weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
med.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
big.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ bs(hour, degree = 4)*as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
sml.err <- mean(lq.loss(predict(object = sml.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
med.err <- mean(lq.loss(predict(object = med.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
big.err <- mean(lq.loss(predict(object = big.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
return(c(sml.err, med.err, big.err))
}
Next, we make our K-fold test sets (and implicitly, our training sets):
K <- 50
N <- dim(arrivals.sub)[1]
## kill off 8 observations and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)
Using a naive for loop, we could implement this as:
err.for <- NULL
system.time(
for (i in 1:K) {
err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1))
}
)
## user system elapsed
## 17.051 0.925 18.118
If you’re good with apply() functions you might upgrade to
## apply version
system.time(
err.apply <- sapply(X = 1:K,
FUN = function(i) {
get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1)
}
)
)
## user system elapsed
## 16.126 0.956 17.129
Neither of the first two methods take advantage of multiple processors. While apply() functions avoid the inherently sluggish nature of for loops in R, they are still ignorant of the processor structure. We want to chop the job into halves, fourths, etc. and use the whole computer!
Here is the same computation written with a foreach() loop
## foreach version
library(foreach)
library(doParallel)
registerDoParallel(cl = 4)
system.time(
err.foreach <- foreach(i=1:K,
.inorder = FALSE,
.combine = "cbind") %dopar% {
get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1)
}
)
## user system elapsed
## 22.395 1.585 9.731
%do% performs the calculations in order, and so uses only one processor. %dopar% is what we generally wish to use..combine can take on the intuitive values c or cbind, as well as more complex functions. Default is to return a list..inorder is a TRUE/FALSE argument. Required to be FALSE for %dopar% to work.apply() functions, the foreach takes an expression instead of a function.Sometimes the list or vector that we are iterating over (in the above case, the vector 1:N) can be a very large object. In our case, the vector is quite reasonable in size, but perhaps the object we were iterating over was a multi-dimensional object, with many values, and a high level of precision. In this case, we’d be storing a massive object, which could potentiall fill up our useable memory and slow things down. We could save memory by only keeping the piece of our list we need for a given calculation, and dumping the rest. This is the idea behind the iterators package in R.
Here is our same example with an iterator instead of a list.
library(iterators)
registerDoParallel(cl = 4)
system.time(
err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
.inorder = FALSE,
.combine = "cbind") %dopar% {
get.errs(test.set = x,
discarded = discarded,
q = 1)
}
)
## user system elapsed
## 25.115 1.119 10.176
When parallelizing a process which generates random numbers we need to be careful, since we aren’t guaranteed that foreach will handle this properly. We could wind up getting numbers that aren’t in fact random! Moreover, if we want to be able to reproduce an analysis which uses a random number generator, the usual set.seed() isn’t guaranteed to work.
Fortunately, there is doRNG. There are many ways to implement this package, the two easiest of which are:
library(doRNG)
registerDoParallel(cl = 4)
blah1 <- foreach(x = 1:10,
.options.RNG = 1985,
.combine = 'c') %dorng% {
rnorm(1)
}
and
registerDoParallel(cl = 4)
registerDoRNG(seed = 1985)
blah2 <- foreach(x = 1:10,
.combine = 'c') %dopar% {
rnorm(1)
}
Note that this gives reproducible results!
sum(blah1 != blah2)
## [1] 0
Some packages come with parallel functionality built in via foreach.
glmnetgamggmcmcplyr